This replication archive contains three main folders, the contents of which are described below:
1. data
2. code
3. results
All contents of this replication archive should be maintained within their current directory structure in order to allow for the R scripts to properly access necessary files. The R packages required are 'dplyr', 'gbm', 'ggplot2', 'gridExtra', and 'pROC'. Runtimes for the R scripts can be found at the end of this document.


(1) The 'data' folder contains 4 (csv format) data files and the original study's readme file (readme_dresselfarid2018.txt), all downloaded directly from the replication materials offered by the authors of the original study at: http://www.cs.dartmouth.edu/farid/downloads/publications/scienceadvances17/

No modifications have been made to the original BROWARD_CLEAN.csv (containing the Broward County defendant data) and CHARGE_ID.csv (containing data on the charge_id variable). However, for the MTURK_NO_RACE.csv and MTURK_RACE.csv data files (each containing the MTurkers' evaluations of the defendant profiles), a minor modification has been made to allow for statistical analysis: an extraneous row (the 2nd row in the original csv files) has been deleted.

Details on these data files can be found in readme_dresselfarid2018.txt.


(2) The 'code' folder contains three subfolders.

The first two subfolders, 'models' and 'models_dummy', contain R scripts that perform the modeling processes for the statistical learning procedures investigated in the study. The contents for these two subfolders are analogous.

In the 'models' subfolder, the logit_model_sample.R, logit_model_boot.R, gbm_model_sample.R, and gbm_model_boot.R files are R scripts that each perform the 1000 prediction iterations for the two statistical modeling procedures (logistic regression and stochastic gradient boosted trees) using the two approaches to modeling uncertainty (sample and bootstrap), as described in the main text of the study and the Supplementary Materials (SM). Each of these scripts produce Rdata files containing the raw results (specifically, predicted probabilities for each test set across the 1000 iterations), which are stored in the 'results' folder (described below) and then analyzed to produce the study's reported results using separate scripts (described below).

The 'models_dummy' subfolder contains the same structure and analogous scripts as the 'models' subfolder, where those scripts perform analogous modeling processes and produce analogous raw results. The difference is that the 'models_dummy' subfolder's scripts do so with an alternative pre-processing of the charge_id predictor (crime charge), which is a high-cardinality categorical variable. Specifically, the 'models_dummy' subfolder performs all modeling after the charge_id predictor has been converted into dummy indicators for each category. See the SM for additional details on the treatment of the charge_id predictor in both the default and alternative settings.

Note: The raw results produced by the scripts described above (i.e. the Rdata files) are already contained within the 'results' folder in this replication archive, so the modeling scripts described above need not be re-run to perform the analyses.

The 'analysis_and_results' subfolder contains all code files required to analyze the raw outputs produced by the scripts described above and replicate all results reported in the study. 

The analysis_main.R script is a flexible script that can be used to produce any specific result reported in the study. It is organized such that the user can specify the specific settings/parameters for analysis, namely choosing (a) the approach to modeling uncertainty, (b) the statistical learning method, (c) whether the default or alternative processing of the crime charge variable is employed, (d) the MTurker data corresponding to the survey wave with race either presented or not presented, and (e) whether to include all defendants or subset to a specific racial group. The desired settings can be specified by simply commenting in/out the appropriate parameters, which are written to be self-explanatory, thereby allowing the user to replicate the results for any specific setting.

In addition, the 'analysis_and_results' subfolder also contains additional subfolders, 'build_figures' and 'build_tables', each of which contain individual standalone R scripts for producing each figure and table reported in the study (both in the main text and SM). All scripts in the 'build_figures' and 'build_tables' subfolders should be run without modification.


(3) The 'results' folder contains four subfolders. 

The first two subfolders, 'sims' and 'sims_dummy', contain the raw results produced by the modeling R scripts. The 'sims' subfolder contains the raw results produced by the modeling R scripts contained in 'models', while the 'sims_dummy' subfolder contains the raw results produced by the modeling R scripts contained in 'models_dummy'. The subfolder architecture within each of these subfolders is structured such that all analysis scripts access the appropriate raw results. The 'sims' and 'sims_dummy' subfolders already contain all raw results required for producing all analyses, figures, and tables.

In addition, the 'results' folder also includes 'figures' and 'tables' subfolders, which contain all figures and tables produced by the scripts within the 'build_figures' and 'build_tables' subfolders (i.e. all figures and tables presented in the study, both main text and SM).


(*) Approximate runtimes for R scripts on a standard laptop (Intel Core i7-6500U @ 2.50 GHz, 8.00 GB RAM, Windows 10 64-bit) can be found below.


Modeling:

code/models/gbm_model_boot.R: ~ 2 hours
code/models/gbm_model_sample.R: ~ 20 hours
code/models/logit_model_boot.R: ~ 30 minutes
code/models/logit_model_sample.R: ~ 30 minutes

code/models_dummy/gbm_model_boot_dummy.R: ~ 6 hours
code/models_dummy/gbm_model_sample_dummy.R: ~ 20 hours
code/models_dummy/logit_model_boot_dummy.R: ~ 30 minutes
code/models_dummy/logit_model_sample_dummy.R: ~ 30 minutes


Analysis and Results:

code/analysis_and_results/analysis_main.R: ~ 2 minutes

code/analysis_and_results/build_figures/figure1_main_text.R: < 1 minute
code/analysis_and_results/build_figures/figure2_main_text.R: < 1 minute
code/analysis_and_results/build_figures/figure3_main_text.R: < 1 minute
code/analysis_and_results/build_figures/figureS1_SM.R: < 1 minute
code/analysis_and_results/build_figures/figureS2_SM.R: < 1 minute
code/analysis_and_results/build_figures/figureS3_SM.R: < 1 minute
code/analysis_and_results/build_figures/figureS4_SM.R: < 1 minute

code/analysis_and_results/build_tables/table1_main_text.R: ~ 1 minute
code/analysis_and_results/build_tables/tableS1_SM.R: ~ 1 minute
code/analysis_and_results/build_tables/tableS2_SM.R: ~ 3 minutes
code/analysis_and_results/build_tables/tableS3_SM.R: ~ 1 minute
code/analysis_and_results/build_tables/tableS4_SM.R: ~ 1 minute
code/analysis_and_results/build_tables/tableS5_SM.R: ~ 3 minutes
code/analysis_and_results/build_tables/tableS6_SM.R: ~ 3 minutes